FAQ

Here you can find a non-exhaustive list of frequently asked questions and their answers.

FAQ

1. Why do we need another workflow manager?

remotemanager is first and foremost for facilitating the running of exploratory workflows on HPC resources (submission engine, vs workflow tool). Unsurprisingly, there is a lot of overlap of features with many existing workflow managers. The intention of remotemanager is to provide simple building blocks with a minimal learning curve; ideally you should find it intuitive and easy to extend. This will ease the work of prototyping and debugging new workflows or processes. Later, you can run your production calculations using remotemanager, or port it to a more “heavy” system.

2. What protocols are used behind the scenes?

Internally, remotemanager relies on a few basic protocols. These should not require any further installs on either your machine or the remote cluster. Additionally, if you have HPC experience, you may already be familiar with most of them:

  1. SSH is used as a baseline for all communication and commands

  2. rsync is the default method for transferring files (though scp is available in the base package)

  3. json is used to serialise objects by default, though yaml, dill and jsonpickle are available

In the name of reducing the barrier to entry, care has been taken to avoid relying on dependencies which require a complex initial setup.

3. What is the meaning of the dataset filenames?

When running jobs, the primary data transfer method is via files. These files are automatically named (which can be somewhat confusing at first); however, there is a strict naming regime in place. The most important structure to search for is the dataset UUID. Within filenames, this will take the form of 8 hexadecimal characters. For example, the database file for a dataset could look something like dataset-5f3ea4bc.yaml. The UUID is based on a hash of the function to be run remotely. The exception to this rule is if you specify a name when creating your Dataset - this will then take precedence over the UUID.

On the remote machine, you will find files with names like dataset-5f3ea4bc-runner-1-jobscript.sh. You can inspect these files when troubleshooting to verify jobscripts were built appropriately, arguments were transfered successfully, etc.

4. Where can I see an example on how to define a Computer?

Computer definition can either be extremely basic or complex, depending on the requirements of your machine and any extra features you wish to add. There are tutorials available within the docs. It is always worth checking if someone has already created a Computer for your connection: they are transferable via YAML, so you may be able to skip the work (or contribute!).

5. How does 2FA work?

2FA is a forefront talking point in the HPC world, and is becoming more and more common.

remotemanager uses ssh keys as a primary factor, and can also interface with secondary factors using the sshpass library. See the relevant section for more info.

6. How can I decorate a previously defined function?

Sometimes you may want to run a function remotely that was defined in a preexisting library. To transform it to Sanzu version, you can use the decorator’s internal function like so:

[2]:
from remotemanager.decorators.sanzufunction import SanzuFunction

def f():
    return 0

# now retroactively apply the decorator
f = SanzuFunction(f)

7. How do I fix rsync errors when running jobs?

By default, rsync is used as the internal Transport. While this allows us to take advantage of some its features, it has two known issues:

  1. MacOS users have an outdated rsync version (2.6.9) by default

  2. rsync has issues when usingsshpass

For MacOS, if you have the ability to install packages on your machine, you can update your rsync. This should be as simple as brew install rsync. Ensure that rsync --version >= 3.0.0

If this is not a viable solution, there are other transport utilities available. scp, for example. You can assign this to a dataset just as you would a URL:

[3]:
from remotemanager import Dataset, URL
from remotemanager.transport import scp

def f():
    return

url = URL("user@host")
trn = scp()

ds = Dataset(f, url=url, transport=trn)

Note

More information is available at this link.

8. How do I deal with serialisation errors?

When using remotemanager, you may encounter errors like this:

[4]:
from uuid import UUID

@SanzuFunction
def f(x):
    return x
f(UUID(int=16))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/remotemanager/remotemanager/dataset/runner.py:117, in Runner.__init__(self, arguments, dbfile, parent, self_id, extra_files_send, extra_files_recv, verbose, extra, **run_args)
    115 try:
    116     # check that the args can be sent via json
--> 117     self._args = json.loads(json.dumps(arguments))
    118     self._generate_uuid()

File /usr/local/lib/python3.12/json/__init__.py:231, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    227 if (not skipkeys and ensure_ascii and
    228     check_circular and allow_nan and
    229     cls is None and indent is None and separators is None and
    230     default is None and not sort_keys and not kw):
--> 231     return _default_encoder.encode(obj)
    232 if cls is None:

File /usr/local/lib/python3.12/json/encoder.py:200, in JSONEncoder.encode(self, o)
    197 # This doesn't pass the iterator directly to ''.join() because the
    198 # exceptions aren't as detailed.  The list call should be roughly
    199 # equivalent to the PySequence_Fast that ''.join() would do.
--> 200 chunks = self.iterencode(o, _one_shot=True)
    201 if not isinstance(chunks, (list, tuple)):

File /usr/local/lib/python3.12/json/encoder.py:258, in JSONEncoder.iterencode(self, o, _one_shot)
    254     _iterencode = _make_iterencode(
    255         markers, self.default, _encoder, self.indent, floatstr,
    256         self.key_separator, self.item_separator, self.sort_keys,
    257         self.skipkeys, _one_shot)
--> 258 return _iterencode(o, 0)

File /usr/local/lib/python3.12/json/encoder.py:180, in JSONEncoder.default(self, o)
    162 """Implement this method in a subclass such that it returns
    163 a serializable object for ``o``, or calls the base implementation
    164 (to raise a ``TypeError``).
   (...)
    178
    179 """
--> 180 raise TypeError(f'Object of type {o.__class__.__name__} '
    181                 f'is not JSON serializable')

TypeError: Object of type UUID is not JSON serializable

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[4], line 6
      3 @SanzuFunction
      4 def f(x):
      5     return x
----> 6 f(UUID(int=16))

File ~/remotemanager/remotemanager/decorators/sanzufunction.py:39, in SanzuWrapper.__call__(self, *args, **kwargs)
     35         raise ValueError(f"Got multiple values for arg {argname}")
     37     kwargs[argname] = arg
---> 39 runner = self._ds.append_run(kwargs, return_runner=True)
     40 runner.run()
     42 self._ds.wait(only_runner=runner)

File ~/remotemanager/remotemanager/dataset/dataset.py:729, in Dataset.append_run(self, args, arguments, name, extra_files_send, extra_files_recv, dependency_call, verbose, quiet, skip, force, lazy, chain_run_args, extra, return_runner, **run_args)
    726 else:
    727     r_id = f"runner-{rnum}"
--> 729 tmp = Runner(
    730     arguments=args,
    731     dbfile=self.dbfile,
    732     parent=self,
    733     self_id=r_id,
    734     extra_files_send=extra_files_send,
    735     extra_files_recv=extra_files_recv,
    736     verbose=verbose,
    737     extra=extra,
    738     **run_args,
    739 )
    741 tmp.result_extension = self.serialiser.extension
    743 tmp = self.insert_runner(
    744     runner=tmp,
    745     skip=skip,
   (...)
    749     return_runner=return_runner,
    750 )

File ~/remotemanager/remotemanager/dataset/runner.py:129, in Runner.__init__(self, arguments, dbfile, parent, self_id, extra_files_send, extra_files_recv, verbose, extra, **run_args)
    126 if not os.path.isdir(self.parent.local_dir):
    127     os.makedirs(self.parent.local_dir)
--> 129 content = self.parent.serialiser.dumps(arguments)
    130 with open(lpath, self.serialiser.write_mode) as o:
    131     o.write(content)

File ~/remotemanager/remotemanager/serialisation/serialjson.py:13, in serialjson.dumps(self, obj)
     11 def dumps(self, obj):
     12     obj = self.wrap_to_list(obj)
---> 13     return json.dumps(obj)

File /usr/local/lib/python3.12/json/__init__.py:231, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    226 # cached encoder
    227 if (not skipkeys and ensure_ascii and
    228     check_circular and allow_nan and
    229     cls is None and indent is None and separators is None and
    230     default is None and not sort_keys and not kw):
--> 231     return _default_encoder.encode(obj)
    232 if cls is None:
    233     cls = JSONEncoder

File /usr/local/lib/python3.12/json/encoder.py:200, in JSONEncoder.encode(self, o)
    196         return encode_basestring(o)
    197 # This doesn't pass the iterator directly to ''.join() because the
    198 # exceptions aren't as detailed.  The list call should be roughly
    199 # equivalent to the PySequence_Fast that ''.join() would do.
--> 200 chunks = self.iterencode(o, _one_shot=True)
    201 if not isinstance(chunks, (list, tuple)):
    202     chunks = list(chunks)

File /usr/local/lib/python3.12/json/encoder.py:258, in JSONEncoder.iterencode(self, o, _one_shot)
    253 else:
    254     _iterencode = _make_iterencode(
    255         markers, self.default, _encoder, self.indent, floatstr,
    256         self.key_separator, self.item_separator, self.sort_keys,
    257         self.skipkeys, _one_shot)
--> 258 return _iterencode(o, 0)

File /usr/local/lib/python3.12/json/encoder.py:180, in JSONEncoder.default(self, o)
    161 def default(self, o):
    162     """Implement this method in a subclass such that it returns
    163     a serializable object for ``o``, or calls the base implementation
    164     (to raise a ``TypeError``).
   (...)
    178
    179     """
--> 180     raise TypeError(f'Object of type {o.__class__.__name__} '
    181                     f'is not JSON serializable')

TypeError: Object of type UUID is not JSON serializable

remotemanager uses JSON as the default way to send data too and from the machine. Unfortunately, custom datatypes cannot be serialized this way. To this end, we provide the SerialDill or JSONPickle serialisers. You can swap to these schemes by importing them from remotemanager.serialisation, and adding them at the Dataset definition. See the relevant section.

Note: Re-defining your dataset this way will not resubmit any jobs on run().

[5]:
from remotemanager.serialisation import serialjsonpickle
@SanzuFunction(serialiser=serialjsonpickle())
def g(x):
    return x
g(UUID(int=16))
appended run runner-0
Running Dataset
assessing run for runner dataset-f5ec0403-runner-0... running
Transferring 6 Files... Done
Fetching results
Transferring 1 File... Done
[5]:
UUID('00000000-0000-0000-0000-000000000010')

9. My function didn’t work, how do I see the error?

remotemanager attempts to handle errors in the same way as results. If something is raised on the remote side, it will be captured in the errors property of the dataset.

Note that as these behave like results, you may need to call fetch_results() before you can see your error.

10. My error is missing information, how do I get more?

Errors by default return only the last line of the actual error to increase readability. However, the last line of the traceback is not always enough, so checking the full string is wise. For this, the full_error property exists.

This property is attached to the Runner rather than the Dataset. The easiest way to access this is to use the Dataset.failed property, which will return a list of all failed runs.

[6]:
from remotemanager import Dataset

def f(x):
    # this function will raise an exception
    if x < 0:
        raise RuntimeError("pretend this error is much longer")
    else:
        return x

ds = Dataset(f, skip=False)
ds.append_run({"x": -1});
ds.append_run({"x": 1});
ds.run(); ds.wait(1, 10); ds.fetch_results()

print(ds.errors)
appended run runner-0
appended run runner-1
Running Dataset
assessing run for runner dataset-44c604f9-runner-0... running
assessing run for runner dataset-44c604f9-runner-1... running
Transferring 7 Files... Done
Fetching results
Transferring 1 File... Done
['RuntimeError: pretend this error is much longer', None]
[7]:
for runner in ds.failed:
    print(f"runner: {runner}")
    print("error:")
    print(runner.full_error)
runner: dataset-44c604f9-runner-0
error:
Traceback (most recent call last):
  File "/home/test/remotemanager/docs/source/temp_runner_remote/dataset-44c604f9-runner-0-run.py", line 20, in <module>
    result = repo.f(**kwargs)
             ^^^^^^^^^^^^^^^^
  File "/home/test/remotemanager/docs/source/temp_runner_remote/dataset-44c604f9-repo.py", line 179, in f
    raise RuntimeError("pretend this error is much longer")
RuntimeError: pretend this error is much longer

11. I made a mistake, how do I start over?

When prototyping it is often easier to fail fast and start over! There are a few methods available to assist with this. The “manual” method is to go into the file system and delete the database file associated with your Dataset. This will look something like dataset-{8 character UUID}.yaml, unless you named your dataset. If your folder is complex, you can ask the dataset to do that for you with a hard_reset() call. This will attempt to delete the local database file, the runners and any associated results/errors.

12. Some of my Runners failed, but not others. What do I do?

When running large datasets, sometimes you can have some runners fail but not others. In this case, Dataset and Run() offer some tools. The simplest case is if the failures are because of a machine issue, or resource issue. We can simulate this by setting up a Dataset that will fail if a file is present.

[9]:
from remotemanager import Dataset

import os

def f(inp, t=0):
    import os

    fail_flag = f"fail_{inp}"
    if os.path.exists(fail_flag):
        raise ValueError(f"found file {fail_flag}, raise error")

    return inp

ds = Dataset(f, skip=False)

ds.append_run({"inp": 1})
ds.append_run({"inp": 2})

try:
    os.makedirs(ds.remote_dir)
except FileExistsError:
    pass

# make runner 2 Fail
with open(os.path.join(ds.remote_dir, "fail_2"), "w+") as o:
    o.write("")

ds.run()
ds.wait(1, 10)
ds.fetch_results()
ds.results
appended run runner-0
appended run runner-1
Running Dataset
assessing run for runner dataset-5c161123-runner-0... running
assessing run for runner dataset-5c161123-runner-1... running
Transferring 7 Files... Done
Fetching results
Transferring 1 File... Done
Warning! Found 1 error(s), also check the `errors` property!
[9]:
[1, RunnerFailedError('ValueError: found file fail_2, raise error')]

To rerun only failed runners, you can use `Dataset.retry_failed(). This will look for runners that are marked as failed, and run only those ones.

[10]:
os.remove(os.path.join(ds.remote_dir, "fail_2"))

ds.retry_failed()
ds.wait(1, 10)
ds.fetch_results()
ds.results
Running Dataset
assessing run for runner dataset-5c161123-runner-1... force running
Transferring 5 Files... Done
Fetching results
Transferring 1 File... Done
[10]:
[1, 2]

Alternatively, if a runner fails thanks to its arguments, the best option is to delete that runner and add a new one with the proper args. See the dedicated Failure Tutorial for more info.

11. I updated to version 0.11.x and now my Computers don’t import. What do I do?

Version 0.11.0 changed a lot of things with how Computers are defined. It is a simple process to update, and should preserve all configurations. See the section regarding this.